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ON INFORMATION PLUS NOISE KERNEL RANDOM MATRICES 

By Noureddine El Karoui^ 

University of California, Berkeley 

Kernel random matrices have attracted a lot of interest in recent 
years, from both practical and theoretical standpoints. Most of the 
theoretical work so far has focused on the case were the data is sam- 
pled from a low-dimensional structure. Very recently, the first results 
concerning kernel random matrices with high-dimensional input data 
were obtained, in a setting where the data was sampled from a gen- 
uinely high-dimensional structure — similar to standard assumptions 
in random matrix theory. 

In this paper, we consider the case where the data is of the type 
"information -I- noise." In other words, each observation is the sum 
of two independent elements: one sampled from a "low-dimensional" 
structure, the signal part of the data, the other being high-dimensional 
noise, normalized to not overwhelm but still affect the signal. We con- 
sider two types of noise, spherical and elliptical. 

In the spherical setting, we show that the spectral properties of 
kernel random matrices can be understood from a new kernel matrix, 
computed only from the signal part of the data, but using (in gen- 
eral) a slightly different kernel. The Gaussian kernel has some special 
properties in this setting. 

The elliptical setting, which is important from a robustness stand- 
point, is less prone to easy interpretation. 

1. Introduction. Kernel techniques are now a standard tool of statisti- 
cal practice and kernel versions of many methods of classical multivariate 
statistics have now been created. A few important examples can be found in 
Scholkopf and Smola (2002) (see the description of kernel PCA, pages 41-45) 
and Bach and Jordan (2003) (for kernel ICA), for instance. There are several 
ways to describe kernel methods, but one of them is to think of them as clas- 
sical multivariate techniques using generalized notions of inner-product. A 
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basic input in these techniques is a kernel matrix, that is, an inner-product 
(or Gram) matrix, for generahzed inner-products. If our vectors of obser- 
vations are Xi, . . . the kernel matrices studied in this paper have 
entry — or f{X[Xj), for a certain /. Popular examples include 

the Gaussian kernel [entries exp(— — Xj||2/2s^)], the Sigmoid kernel [en- 
tries tanh(KXj-Xj +9)] and polynomial kernels [entries {X[XjY]. We refer 
to Rasmussen and Williams (2006) for more examples. As explained in, for 
instance, Scholkopf and Smola (2002), kernel techniques allow practition- 
ers to essentially do multivariate analysis in infinite-dimensional spaces, by 
embedding the data in a infinite-dimensional space through the use of the 
kernel. A nice numerical feature is that the embedding need not be speci- 
fied, and all computations can be made using the finite-dimensional kernel 
matrix. Kernel techniques also allow users to do certain forms of nonlin- 
ear data analysis and dimensionality reduction, which is naturally very de- 
sirable. Zwald, Bousquet and Blanchard (2004) and von Luxburg, Belkin 
and Bousquet (2008) are two interesting relatively recent papers concerned 
broadly speaking with the same types of inferential questions we have in 
mind and investigate in this paper, though the settings of these papers is 
quite different from the one we will work under. 

Kernel matrices and the closely related Laplacian matrices also play a 
central role in manifold learning [see, e.g., Belkin and Niyogi (2003) and 
Izenman (2008) for an overview of various techniques]. In "classical" statis- 
tics, they have been a mainstay of spatial statistics and geostatistics in 
particular [see Cressie (1993)]. 

In geostatistical applications, it is clear that the dimension of the data is at 
most 3. Also, in applications of kernel techniques and manifold learning, it is 
often assumed that the data live on a low-dimensional manifold or structure, 
the kernel approach allowing us to somehow recover (at least partially) this 
information. Consequently, most theoretical analyses of kernel matrices and 
kernel or manifold learning techniques have focused on situations where the 
data is assumed to live on such a low-dimensional structure. In particular, it 
is often the case that asymptotics are studied under the assumption that the 
data is i.i.d. from a fixed distribution — independent of the number of points. 
Some remarkable results have been obtained in this setting [see Koltchinskii 
and Gine (2000) and also Belkin and Niyogi (2008)]. 

Let us give a brief overview of such results. In Koltchinskii and Gine 
(2000), the authors prove that if Xi are i.i.d. with distribution P, under 
regularity conditions on the kernel k{x,y), the kth largest eigenvalue of the 
kernel matrix M, with entries 
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converges to the kth largest eigenvalue of the operator K defined as 



In this important paper, the authors were also able to obtain fluctuation 
behavior for these eigenvalues, under certain technical conditions [see The- 
orem 5.1 in Koltchinskii and Gine (2000)]. Similar first-order convergence 
results were obtained, at a heuristic level but through interesting arguments, 
in Williams and Seeger (2000). 

These results gave theoretical confirmation to practitioners' intuition and 
heuristics that the kernel matrix could be used as a good proxy for the 
operator K on L'^[dP), and hence kernel techniques could be explained and 
justified through the spectral properties of this operator. 

To statisticians well versed in the theory of random matrices, this set 
of results appears to be similar to results for low-dimensional covariance 
matrices stating that when the dimension of the data is fixed and the number 
of observations goes to infinity, the sample covariance matrix is a spectrally 
consistent estimator of the population covariance matrix [see, e.g., Anderson 
(2003)]. However, it is well known [see, e.g., Marcenko and Pastur (1967), 
Bai (1999), Johnstone (2007)] that this is not the case when the dimension 
of the data, p, changes with n, the number of observations, and in particular 
when asymptotics are studied under the assumption that p/n has a finite 
limit. We refer to the asymptotic setting where p and n both tend to infinity 
as the "high-dimensional" setting. We note that given that more and more 
datasets have observations that are high dimensional, and kernel techniques 
are used on some of them [see Williams and Seeger (2000)], it is natural to 
study kernel random matrices in the high-dimensional setting. 

Another important reason to study this type of asymptotics is that by 
keeping track of the effect of the dimension of the data, p, and of other 
parameters of the problem on the results, they might help us give more ac- 
curate prediction about the finite-dimensional behavior of certain statistics 
than the classical "small p, large n" asymptotics. An example of this phe- 
nomenon can be found in the paper Johnstone (2001) where it turned out 
in simulation that some of the doubly asymptotic results concerning fluctu- 
ation behavior of the largest eigenvalue of a Wishart matrix with identity 
covariance are quite accurate for p and n as small as 5 or 10, at least in the 
right tail of the distribution. [We refer the interested reader to Johnstone 
(2001) for more details on the specific example we just described.] Hence, it 
is also potentially practically important to carry out these theoretical studies 
for they can be informative even for finite-dimensional considerations. 

The properties of kernel random matrices under classical random matrix 
assumptions have been studied by the author in the recent El Karoui (2010). 
It was shown there that when the data is high dimensional, for instance 
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Xi ~ A/'(0, Sp), and the operator norm of Sp is, for example, bounded, kernel 
random matrices essentially act like standard Gram/"covariance matrices," 
up to recentering and rescaling, which depend only on /. Naturally, a certain 
scaling is needed to make the problem nondegenerate, and the results we just 
stated hold, for instance, when M{i,j) = f{\\Xi — Xj\\^/p), for otherwise 
the kernel matrix is in general degenerate. We refer to El Karoui (2010) for 
more details and discussions of the relevance of these results in practice. 
In limited simulations, we found that the theory agreed with the numerics 
even when p was of the order of several lO's and p/n was not "too small" 
(e.g., p/nc^i 0.2). These results came as somewhat of a surprise and seemed 
to contradict the intuition and numerous positive practical results that have 
been obtained, since they suggested that the kernel matrices we considered 
were just a (centered and scaled) version of the matrix XX' . However, it 
should be noted that the assumptions implied that the data was truly high 
dimensional. 

So an interesting middle ground, from modeling, theoretical and practical 
points of view is the following: what happens if the data does not live exactly 
on a fixed-dimensional manifold, but lives "nearby?" In other words, the data 
is now sampled from a "noisy" version of the manifold. This is the question 
we study in this paper. We assume now that the data points Xi € W we 
observe are of the form 

Xi =Yi-\-Zi, 

where Yi is the "signal" part of the observations (and live, for instance, 
on a low-dimensional manifold, e.g., a three-dimensional sphere) and Zi is 
the noise part of the observations (and is, e.g., multivariate Gaussian in 
dimension p, where p might be 100). 

We think this is interesting from a practical standpoint because the as- 
sumption that the data is exactly on a manifold is perhaps a bit optimistic 
and the "noisy manifold" version is perhaps more in line with what statis- 
ticians expect to encounter in practice (there is a clear analogy with linear 
regression here). From a theoretical standpoint, such a model allows us to 
bridge the two extremes between truly low-dimensional data and fully high- 
dimensional data. From a modeling standpoint, we propose to scale the noise 
so that its norm stays bounded (or does not grow too fast) in the asymp- 
totics. That way, the "signal" part of the data is likely to be affected but 
not totally drowned by the noise. It is important to note, however, that the 
noise is not "small" in any sense of the word — it is of a size comparable with 
that of the signal. 

In the case of spherical noise (see below for details but note that the 
Gaussian distribution falls into this category) our results say that, to first- 
order, the kernel matrix computed from information + noise data behaves 
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like a kernel matrix computed from the "signal" part of the data, but, we 
might have to use a different kernel than the one we started with. This 
other kernel is quite explicit. In the case of dot-product kernel matrices 
[i.e., M{i,j) = f{X[Xj)/n], the original kernel can be used (under certain 
assumptions) — so, to first-order, the noise part has no effect on the spectral 
properties of the kernel matrix. The results are different when looking at 
Euclidean distance kernels [i.e., M{i,j) = f{\\Xi — XjUD/n] where the effect 
of the noise is basically to change the kernel that is used. This is in any case 
a quite positive result in that it says that the whole body of work concerning 
the behavior of kernel random matrices with low-dimensional input data can 
be used to also study the "information + noise" case — the only change being 
a change of kernels. 

The case of elliptical noise is more complicated. The dot-product kernels 
results still have the same interpretation. But the Euclidean distance kernels 
results are not as easy to interpret. 

2. Results. Before we start, we set some notation. We use ||M||i? to 
denote the Frobenius norm of the matrix M [so = j M^(i,j)] and 

I II M I II 2 to denote its operator norm, that is, its largest singular value. We also 
use ||v||2 to denote the Euclidean norm of the vector v. a V 6 is shorthand 
for max(a, b). Unless otherwise noted, functions that are said to be Lipschitz 
are Lipschitz with respect to Euclidean norm. 

We split our results into two parts, according to distributional assump- 
tions on the noise. One deals with the Gaussian-like case, which allows us to 
give a simple proof of the results. The second part is about the case where 
the noise has a distribution that satisfies certain concentration and elliptic- 
ity properties. This is more general and brings the geometry of the problem 
forward. It also allows us to study the robustness (and lack thereof) of the 
results to the sphericity of the noise, an assumption that is implicit in the 
high-dimensional Gaussian (and Gaussian-like) case. 

We draw some practical conclusions from our results for the case of spher- 
ical noise in Section 2.3. 



2.1. The case of Gaussian-like noise. We first study a setting where the 
noise is drawn according to a distribution that is similar to a Gaussian, but 
slightly more general. 

Theorem 2.1. Suppose we observe data Xi, . . . ,Xn in W^, with 

Xi — Yi-\ — — , 

1/2 

where Zi = T,p Ui where the p-dimensional vector Ui has i.i.d. entries with 
mean 0, variance 1, and fourth moment and We assume 
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that there exists a deterministic vector a and a real Ci > 0, possibly depen- 
dent on n, such that Vi,E(||l^ — aW^) < Ci. Also, /i4 might change with n 
but is assumed to remain bounded. 

{Zi}"^^^ are i.i.d., and we also assume that {YiY^^^ and {Z.j}'^^^ are inde- 
pendent. 

We consider the random matrices Mf with {i,j) entry 

Mf{i,j) = ^fi\\Xi - XjWl) for functions f G J^Co(n), 

where 



'co(n) = {/ stic/i that sup|/(x) - /(y)| <CQ{n)\x-y\\. 



Let us call v = ^I^E^^Ezl^ ]\,fj. ^/jg matrix with {i,j)th entry 
Mf{i,j) -- 



p 

1 



-fm-Yj\\i + 2u), ifi^j, 

-/(O), ifi = j. 

n 



Assuming only that /i4 is bounded uniformly in n, we have, for a constant 
C independent of n, p and Sp, 

"trace(S2) |||Sp|||2 



(1) E*( sup \\Mf - MfWl) <CCi{n 



/6^Co{n) 



p2 p 



We place ourselves in the high- dimensional setting where n and p tend to 
infinity. We assume that trace(Sp)/p^ — 0, as p tends to infinity. 
Under these assumptions, for any fixed Co > and Ci > 0, 

lim sup \\Mf — MfW'p =0 in probability. 

If we further assume that u remains, for instance, bounded, the same 
result holds if we replace the diagonal of M by f{2v)/n, because \f{2v) — 
/(0)| < 2vCq and therefore supjgj-^J/(2i/) - /(0)| < 2vCq. The approxi- 
mating matrix we then get is the matrix with (i, j)th entry ~ ^ lli)' 
where fu{x) = f{x-\-2v), that is, a "pure signal" matrix involving a different 
kernel from the one with which we started. 

We note that there is a potential measurability issue that we address in 
the proof. Our theorem really means that we can find a random variable that 
dominates the "random element" supjgjr^ ||Mj — Mj|||. and goes to in 
probability. (This measurability issue could also be addressed through sepa- 
rability arguments but outer-probability statements suffice for our purposes 
in this paper.) 
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A subcase of our result is the case of Gaussian noise: then Ui is A/'(0,Idp) 
and our result naturally applies. 

We also note that Pn can change with n. The class of functions we con- 
sider is fixed in the last statement of the theorem but if we were to look 
at a sequence of kernels we could pick a different function in the class 
J-'cq for each n [the proof also applies to matrices with entries M{i,j) = 
/(j — Xj\\2)/n, where the functions considered also depend on 

but we present the results with a function / common to all entries] . It should 
also be noted that the proof technique allows us to deal with classes of func- 
tions that vary with n: we could have a varying Co(n). As (1) makes clear, 
the approximation result will hold as soon as the right-hand side of (1) goes 
to asymptotically, that is, Co(n) max(trace(Sp)/p^, |||Sp|||2/p) — )■ 0. Finally, 
we work here with uniformly Lipschitz functions. The proof technique car- 
ries over to other classes, such as certain classes of Holder functions, but the 
bounds would be different. 



Proof of Theorem 2.1. The strategy is to use the same entry-wise 
expansion approach that was used in El Karoui (2010). To do so, we remark 
that \\Zi — ZjW'^/p remains essentially constant [across in the setting 

we are considering — this is a consequence of the "spherical" nature of high- 
dimensional Gaussian distributions. We can therefore try to approximate 
M{i,j) by /(ll^i — yj\\2 + 2i^)/n and all we need to do is to show that the 
remainder is small. 

We also note that if, as we assume, trace(Sp)/p^ — )• 0, then |||Sp|||2 = o{p), 
since |||Sp|||2 < trace(Sp). 

- Work conditional on yn = for i ^ j. 

We clearly have 

\\x, - x,\\l = ||y, - Y,\\l + 2^^i^(K, - y,) + Mi^M. 

Let us study the various parts of this expansion. Conditional on 3^„, if we 
call 2/j J- = Yi — Yj, we see easily that 

and 

(Z, - Zj)'{Y, - Yj) = {Ui - Vj)'Y}J\i^y 

Note that Ui — Uj, which we denote Tij, has i.i.d. entries, with mean 0, 
variance 2 and fourth moment 2fi4 + 6. We call 



a^,j = iZ,-Z,y{Y,-Yj)/^ 
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and 

^ ^ \\Z,-Zj\\l _ ^ trace (Sp) 

With this notation, we have 

\\X, - XjWl - m - YjWl + 2u) = 2a,,j + A,,-. 
Therefore, for any function / in /"coCn)) 

\f{\\X, - XjWl) - fm - YjWl + 2^^)! < CoHIAj + 2ai,,-|, 
and hence, 

[f{\\X, - X,\\l) - fm - y,\\l + 2^)]' < 2C^{nf[l3l^+Aal^]. 
We naturahy also have 

sup [f{\\X, - X,\\l) - - Y,\\l + 2v)f < 2C7o(n)2[/32 . + 4a2j. 

So we have found a random variable r„ = 2Cg(n)[/3j?^- -\-4afj] that dominates 
the random element Cn = sup/gj-^^^^j [f{\\Xi-Xj\\l)-fmi-Yj\\l + 2u)f . 
One might be concerned about the measurability of — but by using outer 
expectations [see van der Vaart (1998), page 258], we can completely by- 
pass this potential problem. In what follows, we denote by E*(-) an outer 
expectation. (Though this technical point does not shed further light on the 
problem, it naturally needs to be addressed.) 
Hence, 

E*( sup {f{\\x,-x,\\i)- fmi-yj\\l+'^^)?\yn) 

<2Co(n)2(E(/?2^.) + E(4afj3^„,))- 

Let us focus on 'Ej{j3f^) for a moment. Let us call Ti^j = Ui — Uj. We first 
note that \\Zi — ZjW^ = T[ -TipVi^j = trace(SprjjT^ ^•). In particular, 

E(||Zi - ZjWl) = 2trace(Sp), 

so E(/3ij) = 0. Therefore, ^{Plj) = var(||Zi - Zj\\l)/p'^. Now recah the re- 
sults found, for instance, in Lemma A-1 in El Karoui (2010): if the vector 7 
has i.i.d. entries with mean 0, variance a"^ and fourth moment K4, and if M 
is a symmetric matrix, 

E((7'M7)2) = f7^(2trace(M2) +trace(Af)2) + (^4 - 3^^) trace(M o M), 

where MoM is the Hadamard product of M with itself, that is, the entrywise 
product of two matrices. 
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Applying this result in our setting [i.e., using the moments (given above) 
of Fjj, which has i.i.d. entries, in the previous formula] gives 

var(||Zj — II2) = var(r'jjSprjj) = 8trace(Sp) + 2(/i4 — 3) trace(Sp o Sp). 

It is easy to see that trace(SpoSp) < trace(Ilp), since trace(Sp) = ap{i,j) 
and trace(Sp o T,p) = Y^- i). Therefore, 



2 var(||Z,-Zj||2) ^ 8 + 2(/i4-3) 



■ trace(Sp) = O 



trace(Sp) 
p2 



We note that under our assumptions on trace(Sp)/p^ and the fact that /U4 
remains bounded in n (and therefore p), this term will go to as p — )• 00. 

1 /2 

On the other hand, because aij\yn = T'^ yij/^, and because E(rj j) : 
and cov(rjj) = 2Idp, we have 



E(a?,|3^„) = 2^^^ < 2|||Sp|||2^ < 4| 



^p\\\2- 



\Yi - a\\l + \\Yj - a\\l 



p ' p ' p 

Hence, we have for C a constant independent of Sp, p and n, 

E*( sup {f{\\Xi-Xj\\l)-fm-Yj\\l + 2u)f\yn 



/e^Co{n) 



< CClin) 



trace(S2) |||Sp|||2 



+ 



\Yi - a\\i + \\Y^-a\ 



p- p 

This inequality allows us to conclude that, for another constant C, 



trace(S^) |||Sp|||2 



1 " 

71 ^ ^ 



E* sup \\M^-Mf\y-p\yA<CCi(n 
since clearly, 

sup \\M^-M^fp<^Y. ^^^p {!{u^-M?2)-^m-Y^f2^1v)f. 

Under the assumption that E(||l^, — aH^) exists and is less than Ci, we 
finally conclude that 

"trace(Ilp) |||Sp|||2 



E* ( sup ||Af/ - M^f^ < CCl{n 



and (1) is shown. 

Therefore, under our assumptions. 



p^ 



+ 



-Ci 



B*( sup \\Mf-Mffp) =0(1). 
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Hence, when n and p tend to oo, 

sup ||M — M|||^ — >• in probability, 

as announced in the theorem. □ 

2.2. Case of noise drawn from a distribution satisfying concentration in- 
equalities. The proof of Theorem 2.1 makes clear that the heart of our 
argument is geometric: we exploit the fact that \\Zi — ZjW^/p is essentially 
constant across pairs {i,j). It is therefore natural to try to extend the the- 
orem to more general assumptions about the noise distribution than the 
Gaussian-like one we worked under previously. It is also important to un- 
derstand the impact of the implicit geometric assumptions (i.e., sphericity 
of the noise) that are made and in particular the robustness of our results 
against these geometric assumptions. 

We extend the results in two directions. First, we investigate the gener- 
alization of our Gaussian-like results to the setting of Euclidean-distance 
kernel random matrices, when the noise is distributed according to a dis- 
tribution satisfying a concentration inequality multiplied by a random vari- 
able, that is, a generalization of elliptical distributions. This allows us to 
show that the Gaussian- like results of Theorem 2.1 essentially hold under 
much weaker assumptions on the noise distribution, as long as the Gaussian 
geometry (i.e., a spherical geometry) is preserved (see Corollary 2.3). The 
results of Theorem 2.2 show that breaking the Gaussian geometry results in 
quite different approximation results. 

We also discuss in Theorem 2.4 the situation of inner-product kernel ran- 
dom matrices under the same "generalized elliptical" assumptions on the 
noise. 

2.2.1. The case of Euclidean distance kernel random matrices. We have 
the following theorem. 

Theorem 2.2 (Euclidean distance kernels). Suppose we observe data 
Xi , Xn in MP , with 

Xi — Yi -\- Ri — — . 

We place ourselves in the high- dimensional setting where n and p tend to 
infinity. We assume that {Yi}"^^^ ~ P„,. 

{Zi}^^^ are i.i.d. with E(Zj) = 0, and we also assume that yn = {^iliLi 
and {Zi}f^i are independent. Ri are random variables independent of{Zi}f^-^ 

We now assume that the distribution of Zi is such that, for any 1-Lipschitz 
function F, if fip = E(F(Zj)), 

Pi\F{Z,) - >r)< Cexp(-co/) ^ h{r), 



INFORMATION PLUS NOISE KERNEL RANDOM MATRICES 



11 



where for simplicity we assume that cq, C and b are independent of p. We 
call v = E{\\Zi\\2)/p and assume that v stays hounded asp— t-oo. 

We assume that Vi, |-Ri| G [roo(p), -Roo(p)]; where roo(p) and Roo{p) are 
deterministic sequences depending on p. We assume without loss of gener- 
ality that Roo{p) > 1- 

Calling A4{yn) = maxj^j \\Yi — Yj\\2, we assume that there exists A4p such 
that P{M{yn) < Mp) 1 and e > such that 



max(A^y,i^^(p)) ^-(^)(^°g-+^^°g-)^)^^^ ^ 0. 
Then we have 



(2) max\\\Xi- Xj\\2-[\\Yi-Yj\\2 + i^{Rf +Rm^O in probability. 

We call VV(3^n) = minj^j||li — Yj\\2, and suppose we pick Wp such that 
-P(W(3^n) > VVp) — 1. (Note that Wp = is always a possibility.) 

We call, for r]>0 given, Ip{r]) = [Wp + 2i/r^(p) - r],Mp + 2uR'^{p) + r]], 
and 

•^Ci,/p(»7) = {/ s^ic/i i/iai sup \f{x)-f{y)\<Ci\x-y\\. 

x,yelp{ri) 



We consider the random matrices Mf with {i,j) entry 

MfiiJ) = - X,g) for f G TcjM- 



n 

Let us call Mj the matrix with {i,j)th entry 
( 1 

Mf{i,j) = 



f{\\Yi-Y,\\i + v{Rl + R^)), ifi^j, 

-/(O), ifi=j. 
n 



We have, for any given Ci > and r] > 0, 

(3) lim sup \\Mf — Mf\\p = in probability. 

n,iD— >oo f^rr 

We have the fohowing corohary in the case of "spherical" noise, which is 
a generaHzation of the Gaussian- Hke case considered in Theorem 2.1. 

Corollary 2.3 (Euclidean distance kernels with spherical noise). Sup- 
pose we observe data Xi, . . . ,Xn in W, with 

Xi = Yi-\ — — , 
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where Yi and Zi satisfy the same assumptions as in Theorem 2.2 [with 
''oo(p) = Roo{p) = !/• Then the results of Theorem 2.2 apply with 



and 



MfiiJ) 



-/(O), ^/i = J. 

n 



As in Theorem 2.1, we deal with potential measurability issues concerning 
the sup in the proof. Our theorem is really that we can find a random 
variable that goes to 03rith probability 1 and dominates the random element 
supjgjr^^ ^ ^^^j \\Mf — Mf\\p — an outer-probability statement. 

This theorem generalizes Theorem 2.1 in two ways. The "spherical" case, 
detailed in Corollary 2.3, is a more general version of Theorem 2.1 limited 
to Gaussian noise. This is because the Gaussian setting corresponds to 6 = 2 
and Co = l/(2|||Sp|||2). However, assuming "only" concentration inequalities 
allows us to handle much more complicated structures for the noise distribu- 
tion. Some examples are given below. We also note that if the Yi's (i.e., the 
signal part of the Xj's) are sampled, for instance, from a fixed manifold of 
finite Euclidean diameter, the conditions on M are automatically satisfied, 
with A4p being the Euclidean diameter of the corresponding manifold. 

Another generalization is "geometric": by allowing Ri to vary with i, we 
move away from the spherical geometry of high-dimensional Gaussian vec- 
tors (and generalizations), to a more "elliptical" setting. Hence, our results 
show clearly the potential limitations and the structural assumptions that 
are made when one assumes Gaussianity of the noise. Theorem 2.2 and 
Corollary 2.3 show that the Gaussian-like results of Theorem 2.1 are not ro- 
bust against a change in the geometry of the noise. We note however that if 
Ri is independent of Zi and E(/2|) = 1, cov{RiZi) = cov(Zj), so all the noise 
models have the same covariance but they may yield different approximating 
matrices and hence different spectral behavior for our information + noise 
models. 

However, the spherical results have the advantage of having simple inter- 
pretations. In the setting of Corollary 2.3, if we assume that /(O) and /(2z^) 
are uniformly bounded (in 77,) over the class of functions we consider, we can 
replace the diagonal of M by f[2v)/n and have the same approximation 

results. Then the "new" M is a kernel matrix computed from the signal 
part of the data with the new kernel fu{x) = f{x + 2i/). 

To make our result more concrete, we give a few examples of distributions 
for which the concentration assumptions on Zi are satisfied: 
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• Gaussian random variables, for which we have cq = 1/(2|||S|||2). We refer 
to Ledoux [(2001), Theorem 2.7] for a justification of this claim. 

• Vectors of the type y^pv where v is uniformly distributed on the unit 
(^2-)sphere in dimension p. Theorem 2.3 in Ledoux (2001) shows that 
our assumptions are satisfied, with c{p) = (1 — l/p)/2 > cq = 1/4, after 
noticing that a 1-Lipschitz function with respect to Euclidean norm is 
also 1-Lipschitz with respect to the geodesic distance on the sphere. 

• Vectors T^Jpv, with v uniformly distributed on the unit (^2-)sphere in W 
and with TT' = S having bounded operator norm. 

• Vectors of the type p^/^v, 1 <b <2, where v is uniformly distributed in 
the unit ball or sphere in W. (See Ledoux [(2001), Theorem 4.21] which 
refers to Schechtman and Zinn (2000) as the source of the theorem.) In 
this case, cq depends only on b. 

• Vectors with log-concave density of the type e~^^^\ with the Hessian of 
U satisfying, for all x, Hess(f/) > 2coIdp, where cq > is the real that 
appears in our assumptions. See Ledoux [(2001), Theorem 2.7] for a jus- 
tification. 

• Vectors v distributed according to a (centered) Gaussian copula, with 
corresponding correlation matrix, S, having |||S|||2 bounded. We refer to 
El Karoui (2009) for a justification of the fact that our assumptions are 
satisfied. [If v has a Gaussian copula distribution, then its ith entry sat- 
isfy Vj = <I>(iVj), where N is multivariate normal with covariance matrix 
S, T, being a correlation matrix, that is, its diagonal is 1. Here $ is the cu- 
mulative distribution function of a standard normal distribution. Taking 
V = V — 1/2 gives a centered Gaussian copula.] This last example is in- 
tended to show that the result can handle quite complicated and nonlinear 
noise structure. 

We note that to justify that the assumptions of the theorem are satisfied, it 
is enough to be able to show concentration around the mean or the median, 
as Proposition 1.8 in Ledoux (2001) makes clear. 

The reader might feel that the assumptions concerning the boundedness 
of the Ri's will be limiting in practice. We note that the same proof es- 
sentially goes through if we just require that |i?j|'s belong to the interval 
[roo{p),Roo{p)] with probability going to 1, but this requires a little bit more 
conditioning and we leave the details, which are not difficult, to the inter- 
ested reader. So for instance, if we had a tail condition on \Ri\, we could 
bound max|i?j| with high probability to get a choice of Roo{p)- So this 
boundedness condition is here just to make the exposition simpler and is 
not particularly limiting in our opinion. On the other hand, we note that 
our conditions allow dependence in the i?j's and are therefore rather weak 
requirements. 

Finally, the theorem as stated is for a fixed Ci, though the class of func- 
tions we are considering might vary with n and p through the influence of 
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Ip{rj). The proof makes clear that Ci could also vary with n and p. We 
discuss in more details the necessary adjustments after the proof. 

Proof of Theorem 2.2. We use the notation = and Py^ 

to denote probability conditional on 3^„. We call C = : A^(3^n) < -Mp}- 

Let us also call yiZn = {{Yilf^]^, similarly, PyTe^ denotes prob- 
ability conditional on yiZn- We cah CU = {3^7^„ G £}. We will start by 
working conditionally on yiZn and eventually decondition our results. 

We assume from now on that the yiZn we work with is such that 3^.„ ^ C. 
Note that P(3^n S £) — )• 1 by assumption and also P{yiZn G CTZ) — )• 1. 

The main idea now is that, in a strong sense, 



where v = E(Z?). To show this formally, we write 

'J' 



1^* - \\l - [\\Y, - Yj \\l + [Rl + R]>] = 2«*,j + A,: 



where 



and 



VP 



P P 

Our aim is to show that, as n and p tend to infinity, 

max|ajj| + — )• in probability. 

- On maxi^j\aij\. 
Note that if z = j, Oij = 0. Clearly, 

PynA\ai,,\ > 2r) < Pyn,^ (^|i?,|M]l_M > , 

+ ^':v7^„^|i^,l >r 

Since we assumed that \Ri\ < Roo{p), we see that the function Fij{Z) = 
RiZ'iYi — Yj)/ yfp is Lipschitz (with respect to Euclidean norm), with Lip- 
schitz constant smaller than {M.pYl'^ R^{p) / when y^ is in C Also, 
since E{Zi) = 0, E(Fjj(Z)|3^7?.„) = 0, where the expectation is conditional 
on yiZn- Hence, our concentration assumptions on Zi imply that 

PynS\R^\\Z'^{y^ " Y,) / ^\ > v) < C eM-co{p"^r /[MI'^ RUp)]f)- 
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Therefore, if we use a simple union bound, we get 



PynJma^\ai,,\>2r) <2Cn^exp{-co{p'/^r/[Ml/^Ro.{p)]n■ 
In particular, if we pick, for e > 0, tq = Roo{p)-Mp^'^ p~^^'^ (logn + (logny )^^^ {2 / 
cq)^^'', we see that 

Pynjnmx\aij\ > 2ro) < 2Cn^ exp{-coip^/\o/[Ml/^ Roo{p)])') 



2Cexp(-2(logn)^) ^0. 



Since 



p(max\aij \ > t) < p(max\aij \ > t and yiZn G CTZ) + P(yTZn ^ CTZ), 



and since the latter goes to 0, we have, unconditionally, 

P^nia,x|aij| > 2ro^ — )• 0. 

- On maxj^j|/3jj|. 

We see that if A and B are vectors in M^, the map Nr^^r. — )• 
\\RiA — RjB\\2 is {\Ri\ V |i?j|)-Lipschitz on M?p equipped with the norm 
\\A\\2 + ||i?||2j by the triangle inequality. Therefore, using Propositions 1.11 
and 1.7 in Ledoux (2001) [and using the fact that h{r) — t- as r — ?• oo and h 
is continuous when using the latter], we conclude that 

(4) PynA\\\RiZ^-RjZJ\\2-Bi\\RiZi-RJZ,\\2)\>r)<4hir/i2RM))■ 

If now 7i,, = B{\\R,Zi - RjZjUyiZn), and if n = 2i?oo(p)(2/co)i/^logn + 
(logn)^)V^'p-i/2^ 



max 



\RiZi — RjZj\\2 — 7i 



> ri I < A'exp(-(logn)^) — ^ 0, 



where i^T is a constant which does not depend on yTZn- So we conclude that 
unconditionally, if 



Aq = max 
P(Ao>ri)^0. 



\RiZi — RjZj\\2 — 7ij 



Note also that under our assumptions, ri — )• 0. Recall that we aim to show 
that 

\RiZi — Rj Zj\\2 



A2 = max 



p 



HRf + Rf. 



in probability. 



16 



N. EL KAROUI 



Let us first work on 



\RiZi- RjZj II 2 - 



Ai = max 

Using the fact tliat a? — h"^ = {a — b){a + b), and therefore, |a^ — 6^| < |a 
6|(|a — b\ + 2\b\), we see that 



max 



i — bj i\ < max|ajj — bij \ ( 



max a< 



+ 2max|6- 



If we choose Ojj = \\RiZi — RjZj\2l\/v and ftjj = we see that the 

previous equation becomes 



Ai < Ao Ao + 2max^ 

Therefore, if we can show that Aq maxj^j 7jj/y/p goes to in probability, 
we will have Ai — t- in probability. Using the concentration result given 
in (4), in connection with Proposition 1.9 in Ledoux (2001) and a slight 
modification explained in El Karoui (2010), we have 

„2 



{R^ + R])u - ^ = vary7^„(||i^^^^ - RjZjh/^) 



(5) 



P 



< 



Kb 



P b{coy/'' p 



Using our assumption that u remains bounded, we see that 

remains bounded. 



1 7i,j 
■ max ■ 



^oo(p) if^j y/P 

Therefore, for some K independent of p, 

max^Ao < KRoo{p)ri, 

if^j y/P 

with probability going to 1. Our assumptions also guarantee that Roo{p)fi 
0, so we conclude that, for a constant K independent of p. 



max 



\RiZi - RjZjWl - 7? ■ 



P 



Ai</frii?oo(p)^0 

with probability going to 1. 



Using (5), we have the deterministic inequality 



max 



{Rf + R])u - ^ 



<Rl,{p)^^rl<^n. 
p 
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So we can finally conclude that with high probability 



^2 = max] J I = max 



v{Ri + Rf, 



P 

Putting all these elements together, we see that when 

_ {MI'^ V Roo {p))Roo jp) (log n + (log nyy/' 

- pl/2 
we can find a constant K such that 



<KriRM^O. 



P I max| 2q j j + ft j | > Kup I — ?• 0. 



In other words, 



(6) p(max\\\Xi - - [\\Yi - Yj\\l + i/(i?f + R^M > Kup) 0. 

This establishes (a strong form of) the first part of the theorem, that is, (2). 

- Second part of the theorem [equation (3)]. To get to the second part, 
we recall that, assuming that / is Ci-Lipschitz on an interval containing 
{\\Xi - XjWl \\Yi - YjWl + ij{Rf + R])}, we have 

|/(||x, - x,\\l) - fm - Y,\\l + HRl + R]))\ 
< Ci\\\Xi - x,\\l - m - YjWl + HRl + R^))\- 

Let us define, for > given, the event 

E = {Vi / J, \\X, - X.lli G /p(7?), \\Y, - YjWl e [Wp,Mp]}, 
and the random element 

Cn= sup umx\f{\\X,-Xj\\l)-f{\\Yi-Yj\\l + u{R^ + R]))\. 

When E is true, ah the pairs {\\Xi - Xj\\l, \\Yi - Yj\\l + u{Rf + are 
in Ip{r]): the part concerning \\Yi — Yj\\2 + J^(^f + R'j) is obvious, and the 
one concerning \\Xi — Xj\\2 comes from the definition of E. So when E is 
true, we also have 

Vi \f{\\X, - X,\\l) - fm - YjWl + <Rl + R]))\ < Ci|2aM + ftjl- 

Let us now consider the random variable Tn such that r„ = Ci on and oo 
otherwise, so t„ = Cil^ + ool£;c. Our remark above shows that 

Cn < Tn max| 2aij + ft j | . 
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Now, we see from our assumptions about {1^}"=^, (6) and the fact that 
Up —7- 0, that for any ?7 > 0, P{E) — t- 1. So we have 

P{rn < Ci) ^ 1. 

Also, maxj^j|2ajj + /3jj| < Kup with probabihty tending to 1, so we can 
conclude that 

P(r„ max|2ajj + < CiKup] 1. 
Hence, we also have 

P*{Cn<CiKUp)^l, 

where this statement might have to be understood in terms of outer 
probabilities — hence the P* instead of P. [See van der Vaart (1998), page 
258. In plain English, we have found a random variable, maxj^j 1 2Qj + 
Pijl, bounded by CiKup with probability going to 1, which is larger than 
the random element C,n-] 

In other respects, we have, for all / G ^Ci,lp(r))i 

\\Mf-Mf\\l<C,l 

since 

max|M/(i,j) -M/(i,j)| 

< - max|/(||X, - X,\\l) - fm - Y,\\l + u{Rj + R]))\ < ^. 

Therefore, 

(7) sup \\Mf — MfWp <Cn^O in probability, 

where once again this statement may have to be understood in terms of 
outer probabilities. The result stated in (3) is proved. □ 

We mentioned before the proof the possibility that we might let Ci vary 
with n and p and still get a good approximation result. This can be done by 
looking at (7) above: Cn is less than KCiUp with high probability, so when 
UpCi{n) — )■ the main approximation result of Theorem 2.2 holds, for a Ci 
and therefore a class of functions, that vary with n (and p). 
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2.2.2. The case of inner-product kernel random matrices. We now turn 
our attention to kernel matrices of the form M{i,j) = f{X[Xj)/n which are 
also of interest in practice. In that setting, we are able to obtain results sim- 
ilar in flavor to Theorem 2.2, with slight modifications on the assumptions 
we make about /. 



Theorem 2.4 (Scalar product kernels). Suppose we observe data Xi, . . . , 
Xn in MP , with 

Xi — Yi -\- Ri —— . 

We place ourselves in the high- dimensional setting where n and p tend to 
infinity. We assume that {1^}"^;^ ~ -P„,. 

{Zi}f^^ are i.i.d. with E(Zj) =0, and we also assume that {Yi}^^^ and 
{Zi}f^i are independent. 

{Ri}f^i are assumed to be independent of {Zi}f^^. We also assume that 
we can find a deterministic sequence Rooip) such that Vi, \Ri\ < Roo{p) and 
Roc{p)>l- 

We assume that the distribution of Zi is such for any 1-Lipschitz function 
F (with respect to Euclidean norm), if fip = 'E{F{Zi)) , 

P{\F{Z,) - ^iF\ > r) < Cexp(-cor'') ^ h{r), 

where for simplicity we assume that cq, C and b are independent of p. We 
call V = E(||Zj||2)/p and assume that v stays bounded as p^ oo. 

We call M = maxjj|y/l^ |, and Mp a real such that P{M < Mp) — > 1. 
We assume that there exists e > such that 



max(^^y^fioo(p)) ^ ^ ^ ^ ^0. 



i?oo(p)(logn+_(logn)^)i/f' 

Vp 

We then have 

(8) max.\XlXj — (Y^Yj -\- 6ijh'Ri)\ ^ in probability. 

We call Jp{r]) = [-Mp - r/ - R'^{p)i/,Mp + + Rl^{p)u] and 
J^Ci,Jpiv) = If such that sup 1/(2;) - /(y)| < Ci|x - y| |. 

x,yeJp{ri) 

We then consider the random matrices Mf with {i,j) entry 
Mf{i,j) = ^f{X[X,) for f e -Fci,j,(,). 
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Let us call M the matrix with {i,j)th entry 

M/(i,j) = <^ 1 

-/(||y,||i + i.i?2), ^f^ = j. 
V n 

We have, for any Ci > and r] > 0, 

lim sup \\Mf — MfWp = in probability. 

n,p— >oo f^-p 

/e>Ci,Jp{r,) 

We note that under our assumptions, we also have |/(||^i||2 + '^Ri) ~ 
f{\\Yi\\l)\< uCiRl^ip), with high probabihty, and uniformly in / in Jci.JpCr?)- 
Therefore, when i?^ — > 0, the result is also valid if we replace the diago- 
nal of Mf by {/(ll^llDliLi/'^ — in which case the new approximating matrix 
is the kernel matrix computed from the signal part of the data. Furthermore, 
the same argument shows that we get a valid operator norm approximation 
of M by this "pure signal" matrix as soon as E^{p)/n tends to 0. 

The same measurability issues as in the previous theorems might arise 
here and the statement should be understood as before: we can find a random 
variable going to in probability that is larger than the random element 

Finally, let us note that once again the theorem is stated for a fixed Ci 
[and hence for an essentially fixed (with n) class of functions, though some 
changes in this class might come from varying Jp{ri)\, but the proof allows 
us to deal with a varying Ci{n). The adjustments are very similar to the 
ones we discussed after the proof of Theorem 2.2 and we leave them to the 
interested reader. 

Proof of Theorem 2.4. The proof is quite similar to that of Theorem 
2.2, so we mostly outline the differences and use the same notation as before. 
We now have to focus on 

X'jXj = Y-Yj + Ri — + Rj—^-— + RiRj — - — ^. 

The analysis of Ri is entirely similar to our analysis of aij in the proof of 
Theorem 2.2. The key remark now is that as function of Zi, when yiZn G CTZ, 
it is, with the new definition of A4p, Roo ip) \/ A^p/p-Lipschitz with respect to 
Euclidean norm. So we immediately have, with the new definition of Mp: if 
7-0 = Roc{p){Mp/py/^{logn + {logny)'^/^{2/coy/'' , and yiZn G CTZ, for some 
K > which does not depend on yTZn, 

Z'Y 

Ri^-J- >ro ] < Kexp(-2(logn) 



Pyn„ ( ™ax 
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Now, since P{yTZn ^ CTZ) — )• 0, we conclude as before that 

Z'Y,; 



P[ max 



Rj 



> To 



0. 



On the other hand, using the fact that ARiRjZ'-Zj = \\RiZi + RjZj\\2 — 
\\RiZi — RjZj\\2, and analyzing the concentration properties of \\RiZi + 
RjZj\\2 in the same way as we did those of \\RiZi — RjZj\\2, we conclude 
that if Up = i2^(p)(2/co)^/''(logn + (logn)^)Vfep-i/2, we can find a constant 
K such that 



max 



\R,Z, RjZjM_^^j^2^j^2^ 



p 



and 



max 



P 



> Ku, 



> Ku, 







0. 



Similar arguments, relying on the fact that || • II2 is obviously 1-Lipschitz 
with respect to Euclidean norm, also lead to the fact that 



PI max 



p2|| 7 ||2 
P 



> Ku, 



0. 



Therefore, we can find K, greater than 1 without loss of generality, such 
that 



P{ max 



Z' Z 

RiRj — di^ivRi 



P 



> Ku, 



0. 



We can therefore conclude that 



p(in&-K\X[Xj - {YlYj + S^JuRj)\ > Kup + 2ro 



0. 



If i?oo(p)max((A^p)^/2,i2oo(p))(log?T- + i^ogny)^/^ / ^/p ^ 0, then both tq 
and Up tend to 0. Therefore, under our assumptions. 



max.\X'iXj - {Y-Yj + dijuR^i 



in probability. 



So we have shown the first assertion of the theorem. 

The final step of the proof is now clear: we have, for all (i, j), 

\f{XiXj) - f{YlY, + 5,,vR^,)\ < Ci\X[X, - [YIY, + k,,vR}% 

when for all X[Xj and ^[Yj + 6iji'Rf) are in Jp{rj). This event hap- 

pens with probability going to 1 under our assumptions. So following the 
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same approach as before and dealing with measurability in the same way, 
we have, with probabiUty going to 1, 

sup ma^\f{XiXj) - f{YlY, + 5i^juR})\ 
/e^ci,jp(.,) 

< Ci max|X;Xj - (y/Yj + 5i^jvR^i)\. 

So we conclude that 

sup max| / {X[ Xj) - f {Y^Yj + 6ij I'Rf ) | ^ in probability. 

From this statement, we get in the same manner as before, 
sup \\Mf — Mf\\p^O in probability. 

/e^ci,jp(J7) n 

As before, the equations above show that if Ci{n){up + tq) — t- 0, the same 
approximation result holds, now with a varying Ci{n). 

2.3. Practical consequences of the results: Case of spherical noise. Our 
aim in giving approximation results is naturally to use existing knowledge 
concerning the approximating matrix to reach conclusions concerning the 
information + noise kernel matrices that are of interest here. In particular, 
we have in mind situations where the "signal" part of the data, that is, 
what we called {yjjf^]^ in the theorems, and / [or /(• + 2u), with u being as 
defined in Theorems 2.1 or 2.2] are such that the assumptions of Theorems 
3.1 or 5.1 in Koltchinskii and Gine (2000) are satisfied, in which case we can 
approximate the eigenvalues of M by those of the corresponding operator in 
L'^{dP). In this setting the matrix M, which is normalized so its entries are 
of order 1/n has a nondegenerate limit, which is why we considered for our 
kernel matrices the normalization — Xj\\2)/n. [This normalization by 

1/n makes our proofs considerably simpler than the ones given in El Karoui 
(2010).] 

Another potentially interesting application is the case where the signal 
part of the data is sampled i.i.d. from a manifold with bounded Euclidean 
diameter, in which case our results are clearly applicable. 

2.3.1. Spectral properties of information + noise kernel random matrices 
from pure signal kernel random matrices. The practical interest of the the- 
orems we obtained above lie in the fact that the Frobenius norm is larger 
than the operator norm, and therefore all of our results also hold in oper- 
ator norm. Now we recall the discussion in El Karoui [(2008), Section 3.3], 
where we explained that consistency in operator norm implies consistency of 
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eigenvalues and consistency of eigenspaces corresponding to separated eigen- 
values [as consequences of Weyl's inequality and the Davis-Kahane sin(0) 
theorem — see Bhatia (1997) and Stewart and Sun (1990)]. 

Theorems 2.1, 2.2, 2.4 therefore imply that under the assumptions stated 
there, the spectral properties of the matrix M can be deduced from those 
of the matrix M. In particular, for techniques such as kernel PCA, we ex- 
pect, when it is a reasonable idea to use that technique, that M will have 
some separated eigenvalues, that is, a few will be large and there will be a 
gap in the spectrum. In that setting, it is enough to understand M, which 
corresponds, if Vz,i?j = 1, to a pure signal matrix, with a possibly slightly 
different kernel, to have a theoretical understanding of the properties of the 
technique. 

For instance, if Vi,i?i = 1, if the assumptions underlying the first-order 
results of Koltchinskii and Gine (2000) are satisfied for M, the (first-order) 
spectral properties of M are the same as those of M, and hence of the 
corresponding operator in L^(dP). 

2.3.2. On the Gaussian kernel. Our analysis reveals a very interesting 
feature of the Gaussian kernel, that is, the case where M{i,j) = exp(—s\\Xi — 
XjllD/n, for some s > 0: when Theorem 2.1 or Corollary 2.3 (i.e.. Theorem 
2.2 with yi,Ri = 1) apply, the eigenspaces corresponding to separated eigen- 
values of the signal -|- noise kernel matrix converge to those of the pure signal 
matrix. 

This is simply due to the fact that in that setting, if S is the matrix such 
that 

S{i,j) = exp(-2i/s)-exp(-s||yi - Yj\\l), 
n 

a rescaled version of the "pure signal" matrix Ai with (i, j)th entry ^ exp(— s||li — 
l^lll), we have 

|||5-M|||2^0. 

This latter statement is a simple consequence of the fact that 5 — M is a 
diagonal matrix with entries (exp(— 2z^s) — l)/n on the diagonal, and there- 
fore its operator norm goes to 0. On the other hand, S clearly has the same 
eigenvectors as the pure signal matrix M. Hence, because the eigenspaces of 
M are consistent for the eigenspaces of S corresponding to separated eigen- 
values, they are also consistent for those of M. (We note that our results 
are actually stronger and allow us to deal with a collection of matrices with 
varying s and not a single s, as we just discussed. This is because we can 
deal with approximations over a collection of functions in all our theorems.) 

Because of the practical importance of eigenspaces in techniques such as 
kernel PCA, these remarks can be seen as giving a theoretical justification 
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for the use of the Gaussian kernel over other kernels in the situations where 
we think we might be in an information + noise setting, and the noise is 
spherical. 

On the other hand, S underestimates the large eigenvalues of M because 
S = exp(— 2z^s)A^ , and obviously exp(— 2z^s) < 1. Using Weyl's inequality 
[see Bhatia (1997)], we have, if we denote by Aj(M) is the ith eigenvalue of 
the symmetric matrix M, 

yi,l<i<n, \X,{M) - X,{S)\ < |||M-5|||2. 

Since the right-hand side goes to asymptotically, the eigenvalues of M 
(the "pure signal" matrix) that stay asymptotically bounded away from 

are underestimated by the corresponding eigenvalues of M. 

When the noise is elliptical, that is, Ri's are not all equal to 1, the "new" 
matrix S we have to deal with has entries 

S{i,j) = exp{-sRf)exp{-sR'j)^exp{-s\\Yi - Yj\\l), 

so it can be written in matrix form 

S = DMD, 

where D is a diagonal matrix with D{i,i) =exp(— si?^). By the same ar- 
guments as above, HIS* — M|||2 — )• in probability, but now S does not have 
the same eigenvectors as the pure signal matrix M. So in this elliptical set- 
ting if we were to do kernel analysis on M, we would not be recovering the 
eigenspaces of the pure signal matrix M . 

2.3.3. Variants of kernel matrices: Laplacian matrices and the issue of 
centering. In various parts of statistics and machine learning, it has been 
argued that Laplacian matrices should be used instead of kernel matrices. 
See, for instance, the very interesting Belkin and Niyogi (2008), where var- 
ious spectral properties of Laplacian matrices have been studied, under a 
"pure" signal assumption in our terminology. For instance, it is assumed 
that the data is sampled from a fixed-dimensional manifold. In light of the 
theoretical and practical success of these methods, it is natural to ask what 
happens in the information + noise case. 

There are several definitions of Laplacian matrices. A popular one [see, 
e.g., the work of Belkin and Niyogi (2008), among other publications], is 
derived from kernel matrices: given M a kernel matrix, the Laplacian matrix 
is defined as 

(-M{i,j), ifi/j, 
L{i,j) = l'^M{i,j), otherwise. 
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When our Theorems 2.2 or 2.4 apply, we have seen that, for relevant 
classes of functions T ^ supjgjrnmaxj^j|Mj(i, j) — Mj(i, j)| — in probabil- 
ity. 

Let us now focus on the case of a single function /. If we call L the 
Laplacian matrix corresponding to M, we have 

nmax|L(«, j) — L(i,j)| — )• in probability, 

max|L(i,i) — L(i,i)| — )• in probability. 

% 

We conclude that |||L — L|||2 — )■ in probability; we can therefore deduce 
that the spectral properties of the Laplacian matrix L from those of L, 
which, when \/i,Ri = 1, is a "pure signal" matrix, where we have slightly 
adjusted the kernel. Here again, the Gaussian kernel plays a special role, 
since when we use a Gaussian kernel, L is a scaled version of the Laplacian 
matrix computed from the signal part of the data. 

Finally, other versions of the Laplacian are also used in practice. In par- 
ticular, a "normalized" version is sometimes advocated, and computed as 

— 1/2 —1/2 

= , if is the diagonal of the matrix L defined above. We 

have just seen that \\\Di — -Dj^llb ~^ ™ probability and |||L — L|||2 — )• in 
probability. Therefore, if the entries of D-^ are bounded away from with 
probability going to 1, we conclude that |||Z)~^|||2 stays bounded with high 
probability and 

\\\Nl-Ni\\\2^Q in probability. 

So once again, understanding the spectral properties of Ni essentially boils 
down to understanding those of Nj^, which is, in the spherical setting where 
yi,Ri = 1, a "pure signal" matrix. In the case of the Gaussian kernel, Nj^ is 
equal to the normalized Laplacian matrix computed from the "pure signal" 
data {yjf^i. 

The question of centering. In practice, it is often the case that one works 
with centered versions of kernel matrices: either the row sums, the column 
sums or both are made to be equal zero. These centering operations amount 
to multiplying (resp., on the right, left or both) our original kernel matrix 
by the matrix H = Id„ — ll'/n, where 1 is the n-dimensional vector whose 
entries are all equal to 1. This matrix has operator norm 1, so when M is 
such that \\\M- M\\\2 0, the same is true for H^MH^ and H^MH^, where 
a and h are either or 1. This shows that our approximations are therefore 
also informative when working with centered kernel matrices. 
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3. Conclusions. Our results aim to bridge the gap in the existing hter- 
ature between the study of kernel random matrices in the presence of pure 
low-dimensional signal data [see, e.g., Koltchinskii and Gine (2000)] and 
the case of truly high-dimensional data [see El Karoui (2010)]. Our study 
of information + noise kernel random matrices shows that, to first order, 
kernel random matrices are somewhat "spectrally robust" to the corrup- 
tion of signal by additive high dimensional and spherical noise (whose norm 
is controlled). In particular, they tend to behave much more like a kernel 
matrix computed from a low-dimensional signal than one computed from 
high-dimensional data. 

Some noteworthy results include the fact that dot-product kernel random 
matrices are, under reasonable assumptions on the kernel and the "signal 
distribution" spectrally robust for both eigenvalues and eigenvectors. The 
Gaussian kernel also yields spectrally robust matrices at the level of eigen- 
vectors, when the noise is spherical. However, it will underestimate separated 
eigenvalues of the Gaussian kernel matrix corresponding to the signal part 
of the data. 

On the other hand, Euclidean distance kernel random matrices are not, in 
general, robust to the presence of additive noise. As our results show, under 
reasonably minimal assumptions on both the noise, the kernel and the sig- 
nal distribution, a Euclidean distance kernel random matrix computed from 
additively corrupted data behaves like another Euclidean distance kernel 
matrix computed from another kernel: in the case of spherical noise, it is a 
shifted version of /, the shift being twice the norm of the noise. For spherical 
noise, this is bound to create (except for the Gaussian kernel) potentially 
serious inconsistencies in both estimators of eigenvalues and eigenvectors, 
because the eigenproperties of the kernel matrix corresponding to the func- 
tion /v (•) = /(■ + 21^) are in general different from that of the kernel matrix 
corresponding to the function /. The same remarks apply to the case of el- 
liptical noise, where the change of kernel is not deterministic and even more 
complicated to describe and interpret. 

Our study also highlights the importance of the implicit geometric as- 
sumptions that are made about the noise. In particular, the results are 
qualitatively different if the noise is spherical (e.g., multivariate Gaussian) or 
elliptical (e.g., multivariate t). Interpretation is more complicated in the el- 
liptical case and a number of nice properties (e.g., robustness or consistency) 
which hold for spherical noise do not hold for elliptical noise. 

We note that our study suggests that simple practical (and entrywise) 
corrections could be used to go from the "signal + noise" situation to an 
approximation of the "pure signal" situation. Those would naturally depend 
on the noise geometry and what information practitioners have about it. 

Our results can therefore be seen as highlighting (from a theoretical point 
of view) the strength and limitations of techniques which rely on kernel 
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random matrices as a primary element in a data analysis. We hope they 
shed light on an interesting issue and will help refine our understanding 
of the behavior of kernel techniques and related methodologies for high- 
dimensional input data. 
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